perm filename CMU[TLK,DBL]1 blob sn#201938 filedate 1976-02-14 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00011 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00003 00002	.COMMENT !XGPCOMMANDS←"/PMAR=2200"
C00005 00003	This talk is about the process of discovery and invention in science.
C00015 00004	We've performed a  sequence of reductions, <chain SLIDE>  from primes
C00027 00005		INTRODUCE HEURISTIC GUIDANCE
C00037 00006	The first  step was to  list all the  concepts the system  would know
C00047 00007	What does  the  system actually  do, now?   From  a heuristic  search
C00059 00008	Just as with  the starting concepts, a  question arises as to  how to
C00075 00009		CARDINALITY EXAMPLE: OPTIONAL
C00112 00010		GOOD RUN
C00124 00011	SUPPLEMENTARY: Finding canonizing function for numbers
C00129 ENDMK
C⊗;
.COMMENT !XGPCOMMANDS←"/PMAR=2200";
.DEVICE XGP
.FONT 1 "BDR40"
.FONT 2 "NGR40"
.FONT 4  "BDI40"
.FONT 5  "BDR66"
.FONT 6  "NGR25"
.FONT 7  "NGR20"
.FONT 8 "SUP"
.FONT 9 "SUB"
.TURN ON "↑α↓_π[]{"
.TURN ON "⊗" FOR "%"
.TURN ON "@" FOR "%"
.PAGE FRAME 39 HIGH 83 WIDE
.AREA TEXT LINES 1 TO 38
.COUNT PAGE PRINTING "1"
.TABBREAK
.ODDLEFTBORDER←EVENLEFTBORDER←1000
.MACRO B ⊂ BEGIN VERBATIM GROUP ⊃
.MACRO E ⊂ APART END ⊃
.MACRO FAD ⊂ FILL ADJUST DOUBLE SPACE PREFACE 2 ⊃
.MACRO FAS ⊂ FILL ADJUST SINGLE SPACE PREFACE 1 ⊃
.FAS
.COMPACT
.SELECT 1
.PORTION THESIS
.PAGE←0
.NEXT PAGE
.INDENT 0
.TURN OFF "{∞→}"   
.BEGIN CENTER 
⊗5↓_Automating the Discovery_↓⊗*
⊗5↓_of Mathematical Concepts_↓⊗*


Text for a seminar at CMU
Monday, Feb. 16, 1976


⊗2Douglas B. Lenat⊗*
Artificial Intelligence Lab
Stanford University



.END

This talk is about the process of discovery and invention in science.

We'll look  at a  few examples of  discovery, use  them to propose  a
simple  model for theory  formation, and describe  a computer program
embodying that  model which  has carried  out  some new  mathematical
research on its own.

The first thing I want to  stress is the difference between solving a
given  problem  and thinking  it  up in  the first  place.   Contrast
playing monopoly, to  inventing new  games of the  same quality.   Or
contrast solving  the missionaries and canibals  problem, against the
ill-defined reasoning which led to formulating it.

If I show you the definition of prime numbers, <SLIDE primes> you can
find some for me without any trouble.   Whoever first formulated that
concept performed a much harder synthesis.

	HOW TO ACCOUNT FOR THE DISCOVERY OF PRIMES

How  might we  account for a  discovery, like  prime numbers?   Say a
ten-year-old has just heard  of these numbers,  and he asks you  "How
did  anybody  ever think  up  that  idea?"  So  you have  to  quickly
rationalize  this discovery; you have to  create a plausible scenario
in which someone  decides to look at  them, and then decides  they're
interesting and worth naming.


Here's one such scenario of their discovery:

A researcher  already knows about  counting and arithmetic,  and he's
just been playing with the notion of "divisors-of" of a number.   One
rule of thumb he  sometimes follows when devising new  theories is to
consider extreme cases. <SLIDE goto extremes> In more precise terms,
this rule  says  that if  we've  just  been studying  an  interesting
function f  which takes elements  of set A into  those of set  B, and
there  is some extremal member of B,  it's worth our time to consider
which elements from  A map into  that element. The more unusual an item
is, the more time you should spend investigating its inverse image.

Say our sets A and  B are the natural  numbers.  Extremal members  of
that set are the smallest numbers: 0, 1, 2, 3, or 4.  If the function
is "number-of-divisors-of",  the rule of thumb  says to construct and
investigate the set of numbers with no divisors, with only 1 divisor,
with 2 divisors  <SLIDE n divisors>, with three divisors,  and so on.
A  priori, any or  all of these  sets could be interesting.   It will
turn out that ⊗4this⊗* set  (2 divisors) is the most  interesting and
useful one. Why?

After examining  some more empirical data, the  researcher may notice
that all numbers seem to factor uniquly into numbers drawn from  this
set.  He'll  conjecture the unique factorization theorem  <SLIDE UFT>
and notice that it  can be made much less awkward if we give this set
a name, <SLIDE: Primes overlay> say PRIME NUMBERS.

	WE DID A PROBLEM REDUCTION

We've just reduced our problem from "how in the world  could somebody
invent the notion  of prime numbers" to the somewhat  simpler task of
"how  in the world could somebody  invent the notion of divisors-of".
<SLIDE Reduc 1> The reduction was accomplished by citing a heuristic,
which said:

⊗4"Investigate  those items  having unusual  or extreme  images under
some operation."⊗*

In this case, our possible items  are numbers, the extreme image  was
the    very   small    number   "2",    and    the   operation    was
number-of-divisors-of.   The inverse image of 2  turned out to be the
set of numbers we know as the primes.

"Looking for extreme cases"
is a  pretty  general  heuristic.

I'ts not a  rule of inference,  just a rule  of thumb; it  doesn't
guarantee  anything, in  the  sense that modus  ponens <SLIDE overlay>
guarantees  to  preserve  validity.    It  certainly  isn't
applicable or useful all  the time.   And yet it's worth  remembering
and using  because it frequently  leads to interesting,  valuable new
concepts  to investigate.  In  the long run,  using this heuristic is
cost-effective.

Say the ten-year-old persists  in asking why; why did  anybody think
of this  concept, divisors-of? By now  you know what to  do. You grab
some rule of thumb, and use it to rationalize that discovery.

In this  case, you  might  realize that  a  reasonable tactic  is  to
investigate the inverse of each interesting  relation.  That accounts
for this discovery,  <inverse OVERLAY> reduces it to the discovery of
the concept  of multiplication,  because looking  at  the inverse  of
multiplication means discovering this concept, divisors-of.

If you  can stand it, you  might try to continue  this reduction even
further.   So  you want  to rationally  explain how  ⊗4this⊗* concept
might have been discovered.

Suppose we knew how to count the number of elements in a set, and how
to take  the cross-product of  two sets.   Then we might  display our
knowledge pictorially thus, < read  thru: SLIDE: square> and a  quite
natural  question is  "what  is this  missing  operation?".   We  can
compute it easily,  at least in this direction. <OVERLAY: square> For
example, if we take these two sets, their sizes are 2 and 3. But this
is  their cross-product,  and  ⊗4its⊗* size  is 6.    So our  mystery
operation  takes the  numbers  2,3 and  returns the  number 6.   Sure
enough,  this   operation  will  turn   out  to  be   multiplication.
Mathematicians call  this activity chasing-around-diagrams,  and it's
one of their favorite games.

Here multiplication was discovered as a numeric analog to cross-product.
An alternative, and perhaps more common derivation, is to consider
multiplication as repeated addition.

.SELECT 1

We've performed a  sequence of reductions, <chain SLIDE>  from primes
to divisors-of to multiplication to counting and cross-product.  Each
step was  justified  by some  rule of  thumb:  Here it  is  "consider
extremals",  here  it is  "consider  the  inverse of  an  interesting
relation", here it is "complete the square diagram".

You  could continue  this analysis,  but at  some point  the concepts
you'd have to deal with  are so elementary that you feel it  makes no
sense to talk about  "discovering" them.  Your strategy then changes,
and instead of trying  to think of  a heuristic, you  try to get  the
10-year old to leave the room.

We can only  repeat this analytical  process until our  original task
has  been reduced to  the "discovery" of  concepts we  consider to be
conceptual primitives.  Of course  this is  a  personal barrier,  and
varies from individual to individual.

When we  hear of a  new discovery, we mentally  try to analyze  it in
this  fashion ourselves.   We  try to  perceive a chain  of concepts,
stretching  from that  ⊗4latest⊗*  discovery  backwards, not  to  our
conceptual  primitives,  but at  least  all the  way  to concepts  we
already are familiar  with.  If  we succeed right  away, we  probably
think that the discovery  wasn't so hard, that we could  have done it
ourselves.

Let me  digress for a moment, and discuss  2 reasons why we sometimes
⊗4fail⊗* to account for a discovery:

.BEGIN FILL INDENT 4,8,0 PREFACE 0

(1) because we're missing a  heuristic which is crucial to this  line
of development;

(2) because the chain is too long for us to construct easily.

Why might we lack some heuristic? Either because we're not experts in
this  field, or, occasionally, because the  researcher has used a new
heuristic.

Why might the chain be too long for us to  perceive?  Usually because
we  simply  weren't  aware of  all  the  current  concepts which  the
discoverer had at his disposal.  So  our chain had to be much  longer
than his, it had  to reach further down before  hitting concepts that
⊗4we⊗* knew.   Occasionally, because the discovery  ⊗4was⊗* in fact a
major advance over pre-exisiting knowledge.

.END

.SELECT 1


If you're a novice in some field, then you'll suffer from  both these
deficiencies,  and you  may find  all the  discoveries in  that field
amazing.   Often we find two friends,  like a nuclear physicist and a
molecular geneticist, who can't converse about their research.

Norbert Weiner was  always decrying this  kind of specialization,  he
used to advocate  interdisciplinary research. This simple model shows
us why.  it might  be worth  the  trouble to  learn all  those  alien
concepts.  One big  advantage you  have  if you  enter  a field  from
another speciality  is that you bring with you  a whole bundle of new
heuristics. Many of them will be inapplicable, but if even one or two
can carry over,  then you're using techniques that no  one else uses,
so  you may  make some advances  right away, as  soon as  you pick up
enough of this jargon, these  concepts to work from.  You  also bring
with you many  new concepts, and that means you  have some unique raw
materials to draw analogies from.

A friend of mine was able to breeze through organic chemistry because
he knew about means-ends  analysis and found it could  be used there.
Phrased  in  terms  of  operators,  goals,  and pre-conditions,  each
chemical  synthesis  problem  became  a  simple  means-ends   search.
Meanwhile, his classmates were relying on rote memorization.

This is the end of the digression. From now on, we'll assume that the
researcher  is working in his  own speciality, that  he knows most of
these concepts and heuristics.

Very often, a  new discovery is  really just a  couple steps in  this
kind of process, one or  two heuristic applications away from what is
already known.  That is  why it's often easy  for us to believe  that
there wasn't really so much effort needed to  make the discovery.  If
we  already know this  concept, and  this heuristic, it  almost seems
obvious that someone would make this step.

Why then don't we  make new discoveries every  few minutes?  Why  are
even some 1-step advances worth publishing?   The answer can be found
by seeing how (according to this model) new discoveries are made.

We're talking  now about reversing the flow of time here.  Instead of
analyzing a discovery by  breaking it into simpler and  simpler ones,
we're now progressing in this  direction, trying to grow new nodes on
this tree, using whatever heuristics are available.

The researcher has a bunch of concepts that he knows about, say a few
thousand.  I  won't even consider  his ⊗4legal⊗* choices,  since they
are quite  literally infinite in number.  Say  he knows a few hundred
heuristics he can apply.   In any  situation, then, he may  literally
have  a million  alternatives to  consider, <OVERLAY  of red  lines>.
These are not his ⊗4legal⊗* choices, but his ⊗4plausible⊗* ones; each
and every one is justified by some heuristic, like this link is.

But nature is  not kind to  us, and only  a very small percentage  of
those new  concepts will turn  out to be  interesting.  What  is even
worse, you can't whiz through this space, stopping at the interesting
concepts, because it may  require months of effort to  decide whether
the concept you're studying is worthwhile or a dead-end.

Even here, where we have just a few concepts and heuristics, the tree
gets bushy  when we try  to go  in this  direction.   We could  apply
"looking at  inverse" here, (cross-product) to  decide to investigate
the  kind  of  operation  we  normally  call  projection.    Or  here
(divisors-of) even if we decide  to look for extremals, which  way do
we go?   Why not  examine numbers which  have very many,  rather than
very few, divisors?

The analysis procedure was a search for a path from a given node back
to primitives, but the  synthesis procedure is blind, doesn't  have a
well-specified goal, so it is combinatorially more explosive.

So it's  harder to make a valuable  new discovery than to rationalize
how someone might plausibly have discovered something  you've already
seen.    This explains  why  discoveries  sometimes seem  so  obvious
⊗4after the fact⊗*, even to their inventor who may have labored years
on the problem.

The same let-down  occurs when  a magician shows  you how he  manages
some  magic trick. For  example, at  talks like  this one.  where the
speaker states some amazing behavior  his program exhibits, but  when
he explains how it works, it loses its impact.

.SELECT 1

	INTRODUCE HEURISTIC GUIDANCE

The  standard  Artificial  Intelligence  method   for  muffling an
explosion  of red arrows is  to introduce purple  and green heuristic
guidance.  <2nd OVERLAY:  green/purple>: purple strategies  highlight
the  most promising  nodes to  work on,  and wiggly  green ideas
suggest furiously one or two operators  to try first on a  given node.

Recall
that each red operator is a  rule of thumb. So the green  strategies tell
us, in a given situation what the best heuristics are to apply.

<Remove Red-lines overlay>

Notice  the references  to  time or  situations that  have  crept in.
Instead of just knowing a rule of thumb like this one 
<SLIDE: heurs.; point to line 1>
would be a more sophisticated piece of knowledge like
<point to line 2>

The left-hand-sides of these situation-dependent  rules
wil typically triggr on
some experimental data that was just collected.  Here, it was running
a predicate  on random arguments  and seeing  what percentage of  the
time it returned the value "True."

The  basic  idea  so  far  is  that  discovery  can  be  successfully
represented  as  a heuristic  search  procedure: <SLIDE:  3  tasks> A
continual expansion of a  tree of concepts and relationships,  guided
by some judgmental criteria, some informal rules of thumb.

Your memory of this talk will fade  with time, but even if drops down
to  just four words, I'd be happy  if they were ⊗4these⊗* four words:
Discovery as Heuristic Search.

	ONE CAN WRITE A THY. FORMATION PROGRAM

If this is true, one should be able to write a computer program which
proposes new  concepts in some scientific domain.   After picking the
field in which  the system  will work,  there are  3 standard  things
you'd have to do, the same things you always  have to do to implement
a heuristic search:

.BEGIN INDENT 0,3,0 PREFACE 0

(1) specify the initial core of knowledge that it will start with,

(2) specify the legal ways of adding to that knowledge

(3) collect the strategies and tactics that experts in that field use
when trying to form theories.

.END

	MATHEMATICAL THY. FORMATION

If anyone asks  at the end, I'll discuss why  mathematics is an ideal
field  in which to  try out these  ideas. For now,  assume that we've
settled on math.  Before  carrying out these 3 steps, I should at least
mention what I mean by a ⊗4mathematical concept⊗* and a ⊗4theory⊗*.

A mathematical  theory <SLIDE: thy> is built  on a ↓_⊗2FOUNDATION⊗*_↓
of primitive,  undefined ↓_objects_↓,  and ↓_operators_↓,  plus  some
postulates  and  axioms  which  interrelate  them.    Each  of  these
qualifies to be called a mathematical ↓_concept_↓.

Onto this foundation is built a tower of theorems -- statements which
are  implied  by  the  foundation  --  plus  some
definitions of new concepts in terms of preexisting ones.

This is  the final,  breath-taking architecture that  we see  in math
textbooks.   <math  textbook-order slide>  All undefined  objects and
axioms are  stated, and  then an infinite define-state-prove  loop is entered.
Occasionally, it's broken  by  some  post-facto motivation  or  real-world
application. This is entire process never involves backtracking.

What they ⊗4don't⊗* show you in textbooks is the unaesthetic scaffolding that  was
required to erect that tower: that scaffolding is empirical induction
and good old-fasioned search, with back-tracking and everything.

Yes,  Virginia,  mathematics  is  an  empirical  science. Most of you know that
already, but you learned it from hard experience, not from any
algebra textbook.

Math research has
very little to do with the smooth flowing proofs you find there. It's
more like experimental physics,  where you gather some data, look for
regularities, and induce some conjectures.  If they can't be  phrased
concisely with the existing terminology, that's a  clue that some new
definitions  should be made.   Just like  we did  near the beginning,
introducing the term "prime numbers" to simplify the statement of the unique
factorization theorem.

<SLIDE:  research-order> So  if anything,  math research
proceeds in  almost the  opposite order from  the way  it is  finally
reported in books and journals.

This  is  done  by   empirical  induction  (1st  line),  by  studying
experimental data and trying to perceive some regularities in it.

Most of the mathematician's empirical forays are fruitless, but there
is no way  to know this before carrying them out.  His  only guidance
are   his
intuitions   about  what  to
investigate next, his purple and green
heuristic  strategies.

What I'm going to describe now  is a computer program which uses  the
paradigm of heuristic search  to develop simple mathematical theories
in this same way.
I'm talking about a real system,
written in INTERLISP.  It lives at the PDP-10
at SUMEX, and occupies 100k.  All the discoveries  I have and will mention
have been made  by the  system working by itself. 

<3tasks,  revisited SLIDE>  Recall that  there is  a standard  set of
tasks to perform, to create such a system.

.SELECT 1

The first  step was to  list all the  concepts the system  would know
about initially. That  is, a bunch of primitive notions which are the
starting state  of the system's  knowledge.   But how  do you  decide
which concepts to include?

Three kinds of criteria came to mind:

First, a ⊗4↓_complete_↓⊗*  set of concepts, from  which all knowledge
in the universe could be derived. But this wouldn't fit into 256k.

More  seriously, I tried to  get a ⊗4minimal⊗*  set of concepts, only
starting with concepts  which are indispensable, fundamental  in some
absolute sense.  But as  we saw earlier, the conceptual primitives of
you or me or the system are purely personal.  Until recently, no  one
realized that arithmetic could be derived from set theory.

Scrapping both complete and minimal criteria, the  next one I thought
about was to isolate just those concepts which young children seem to
have, which Piaget might have called ⊗4pre-numerical⊗*.   This turned
out to be a quite nice set, <SLIDE concept> and these are in fact the
concepts the system actually started with.

Included here are static structural concepts like sets and relations,
and active ones like substitute, union, reverse, compose 2 relations,
and so  on.  In  all, about 100  concepts were called  for. Note that
there is no notion of numbers or proof.

At a  slightly  deeper  level,  we  have  to  decide  precisely  what
information the system will know about each of these concepts.

Certainly we  should provide a  ↓_definition_↓ or  some other way  to
distinguish among the concepts.

For each operation, we should also mention its ↓_domain and range_↓.

Perhaps  we  know  of some  ↓_abstract  representation_↓  in which  a
concept can be quickly manipulated analogically (e.g.,  Venn diagrams
for the concept of Sets).

Also, to  help the  system decide  what to  do, it  would be  nice to
supply some  kind of rough judgment of the worth of each concept: its
interest, its usefulness, its  simplicity, symmetry, etc.   Now we're
veering from  science into  metaphysics, so let  me just say  that we
attach a number to each concept,  which is supposed to represent  its
worth.  Empirically, the  precise numbers turn out not  to affect the
system's behavior much.

There  are  many  other  facets  to  each  concept,  <SLIDE  facets>,
including Examples, Generalizations, Specializations, Analogies,  and
so on.

This set  of facets  is fixed  once and  for all.   Each  concept can
possess  any or all of these facets. But  that's all we can ever know
about a concept.  A concept  ⊗4is⊗* simply a  bunch of facets,  drawn
from  this  fixed  set.   This  idea  is  what  I called  the  BEINGS
representation of knowlede a few years  ago, and is subsumed by  what
is now known as FRAMES.

In the LISP program implementing these ideas,  each concept is stored
as an  atom, and its facets are implemented  as the properties on its
property list.

Let's  look  at  one  particular  concept,  and  see  what  kinds  of
information could  go into  each facet.   We'll consider  the concept
COMPOSE, which  refers to the composition of two relations. We define
AoB(x) as A(B(x)). That is,  apply B and then apply A to  the result.
<COMPOSE  slide>  <go  over each  facet>.    Here  is an  abbreviated
description of that concept.

Several equivalent definitions are provided. Some are computationally
efficient, and others are slow but especially  easy to for the system
to analyze.   Each definition is an  executable LISP predicate, which
takes 3 arguments  and tries to  see whether  the composition of  the
first two is equal to the third.

Several  algorithms are  provided,  again  exhibiting this  trade-off
between speed and transparency.

The specializations  facet points to another concept which is similar
but only composes a relation with itself.

An ⊗4example⊗* of Composition is provided here.

One conjecture stored here is that composition is  associative.  This
was disocvered by the system,  not put in by me.  Since associativity
isn't known  to  the  system,  the entire  statement  of  that  bulky
relationship must be stored.

Down here we have some heuristics for  dealing with compositions.  An
explanation  of  why it's  good for  anything,  how you  can  spot an
interesting compostion  (e.g.,  if the  d-r  of the  composition  are
equal), hints  for filling in  certain facets of any  new composition
which   gets  created,  and  something   worth  checking  about  each
composition.

Notice that the format of each kind of facet  varies from productions
<heurs> to little programs <algs>, to bits of declarative data <d-r>.

Let  me put back  the list of  concepts again,  and also our  list of
facets.<SLIDE>

What does  the  system actually  do, now?   From  a heuristic  search
viewpoint,  the  primary activity  of  the system  is  to  create new
concepts. When  it does so,  however, most  of their  facets will  be
blank, so  that in  practice most  of the system's  time is  spent in
fleshing  out some of the concepts  which it has recently discovered,
filling in someof their facets.

So there are two ways that the system expands  its knowledge. One way
is to find  out more about the already-known  concepts; the other way
is  to  define  new  ones.    <SLIDE  2ways  to  grow>  

In  the  LISP
implementation, that  translates to filling out the properties  of existing
data stuctures (here  some examples of Compositions have been found),
and sometimes creating entirely new data  structures
(Here  the  system  takes  one
particular  example of  Compose and  promotes it 
from being just one entry on one facet of one concept, into beinga full-fledged
concept with facets of its own).

What happens when a new concept is created?  Well, at the moment of its
birth, the system knows
its  definition, and  something about its  worth.  Also  how it
connects to some existing concepts.   
So a few seconds are spent filling in all the facets whose contents are obvious.
Once it's created, most of  its
facets will still be empty, so there will be many new things for the system
to do.

Since the
"fleshing out" activity is quite expensive, it would be a  mistake to
just try to fill out all the facets of all the concepts. 
For example,
the system  starts off with 100 concepts.  Of all the possible facets
they might possess, only a few are filled-in for each one; so there are
2000 blank facets initially.
If would  take days of cpu time  for them all to get  filled in. And
each time a new concept was  proposed, an hour might be shot  working
just to flesh it out completely.  The system must manage its time more carefully.

There must be some "fine structure"  to the process of  filling out
the  tree of  concepts.   This is  a non-standard  problem, not
faced during you typical heuristic search. We can't permit the system to
completely fill out each new node it wants to look at.

The solution I use is to dynamicallt optimize the  expected value of what  the
system is doing.  By that I  mean that it must  look at a bunch  of very
plausible tasks, and choose the most promising of them to execute.
Each task is at the level of fillng out a particular blank facet of a particular
concept.

The  LISP implementation of this  idea is to  have a global job-list,
<SLIDE joblist> an agenda  of suggested activities to do,  an ordered
list of  candidates for the system's attention.   Each entry consists
of a specific task for the system to work on, a priority value, and a
list of reasons why this is a reasonable task to do.
Except for this list of symbolic reasons, this is the same as the agenda
mechanism proposed by Winograd and Bobrow for their new language, KRL.

At some moment,  there  might be many entries on the agenda, one
indicating   that   the   system   could   Generalize  the   relation
Object-equality  somehow,  one suggesting that the system fill-in
some  examples   of  the   newly
discovered concept Primes, and so on.

This agenda governs  the top-level control structure  of the program.
The system picks the highest priority job off this list, and executes it.

In the process of performing a job, 3 kinds of things can happen:

.BEGIN PREFACE 0 INDENT 3

(1) blank facets may get filled in,

(2) new concepts created, and

(3) new jobs added to the agenda.

.END

In practice, so few jobs ever get executed that we only keep around the
top 100 or so at any moment.

Each new job gets entered by a
heuristic, for example  this one  is suggested by  a heuristic  which
recognized that  there  were too  few random  pairs  of objects  that
turned  out  to be  equal.   At  the  moment it  is  entered, whoever
suggested the job will know  some pretty definite reasons for why  it
might be worth doing.   So he attaches a list of reasons  to it.  That
proposer, that heuristic, also assigns to each reason a  numeric 
rating between 0 and
1000.  

One global formula then looks at the job  and the reasons, and
assigns it a single priority rating (also from 0 to 1000).
<SLIDE: Formula> That formula was just guessed at as
a first pass, and it's worked well  enough not to tamper with it.  It
involves  squareroots  and dot-products  of  vectors and  isn't really
worth delving into.

Each concept, each facet,  each job has a number  attached to
it.   This formula uses them  to attach an  interestingness factor to
any given job on the agenda.


Each reason is actually a token, but as you can imagine they're quite
useful for keeping the human user  aware of what the system is  doing
and why.

A surprisingly  important mechanism  comes into  play when  a job  is
proposed which turns  out to be ⊗4already⊗* on the agenda.  <OVERLAY:
new job>  If the  job  is being  proposed for  pretty much  the  same
reasons  as before,  then  its priority  rating  is incremented  only
slightly.   But if  it is being  proposed for a new  reason, then its
priority is increased tremendously.
This is the practical importance of maintaining reasons for each job.

Notice how nondeterministic  the flow of control  is.  The system  looks
over the job-list, picks the job with the highest priority, and tries
to execute it.  That priority rating then dictates how to manage  the
system's resources: for example,  the number of cpu seconds  to spend
before giving up and trying the next entry on the agenda.
The number of list cells a facet can occupy before quitting, and so on.
(This idea was inspired by Carl Hewitt's notion of activation energy.)

.COMMENT OMIT IF SHORT ON TIME;

We are assuming that there  is a self-consistent set of computational
rules  for manipulating the  estimated worth of all  the concepts and
jobs. 

In my own modest way, I  call this scheme a zeroth-order calculus  of
interestingness factors.   Predicate calculus  tries to  preserve and
manipulate validity.   Carnap and Hintikka's Lambda-calculus lets you
manipulate certainty  factors.   This  first pass  at a  calculus  of
priority  values tries  to  preserve  and manipulate  interestingness
factors or  estimated worth. Much work remains to be done on it until
it deserves more just mentioning in passing.

Let me warn you  about using this scheme, however.  The whole idea of
a global  priority  value is  quite  dangerous.   
Comparing the worths of "compose" and "sets" is
really
comparing apples to oranges.

I think that this scheme works adequately here only because
neither the  heuristics  nor  the  jobs are  strongly
coupled; they are  all pretty  much independent  of each  other.

For example, it
doesn't really matter which order the top 2 jobs get executed in; 
no one cares whether we look for primes right before generalizing equality,
or right afterwards.
What ⊗4does⊗*
matter is  that they are  near the  top, not near  the bottom, of the agenda.
That's why a crude kind of evaluation function is all you need.

.SELECT 1

Just as with  the starting concepts, a  question arises as to  how to
collect the starting ⊗4heuristics⊗* that will guide the system.

The standard  way to do this, in  any domain-specific knowledge-based
system, is to have some experts from that field introspect on all the
clues they use, all the knowledge that guides their research efforts.

Unfortunately, most scientists' heuristics are demon-like.
That is, they only occur to our conscious minds in situations where
they might plausibly be used. We grossly underestimate the  number of
such tactics that we use every day.  We just never access them out of
context.

What  I did was to  consider each facet of each  concept in turn, and
try to think of all the heuristics that I might  use in that context.
In that way, several hundred heuristics were found.

They fell into  a few major types: Some of  them suggest new concepts
to consider <SLIDE:  heurs> (like  "look at the  inverse") but  these
rarely lead to  cost-effective results.   Much more valuable  are the
ones which  give hints for ⊗4when⊗* to do  things ("if you can't find
any examples of when a predicate is satisfied, then try to generalize
it").

Also valuable are those which tell  ⊗4how⊗* to do something ("to fill
in  examples of A, find an operator whose range is A,
and apply it").  

Finally,  there are heuristics  for filling  in new  heuristics (like
"after using the `extremals' heuristic  to derive a new subset  b↑-↑1
of A, then add the heuristics that ` b↑-↑1 is  a good set to use when
working   with  or  trying   to  derive   conjectures  involving  the
operationsing f and f↑-↑1.  ' ").  That's not  clear, so let me  give
you an example of how we'd use this.

The system discovered primes, using the look-for-extremals heuristic.
⊗4This⊗* rule then said to create  a brand  new
heuristic which read:

.ONCE INDENT 0 PREFACE 0

⊗2"Primes  are useful to  generate and to  test conjectures involving
multiplication and divisors"⊗*

Because f and f↑-↑1 are divisors and multipliction, b↑-↑1  is the set
of prime  numbers, A is  the set of numbers.   

There are  very few of
these that I know of, but each one can have a dramatic impact on  the
system's behavior.

By and large, the valuable heuristics all have the form "in situation
S,  consider doing X".   In my  system, therefore,  each heuristic is
written as a production rule.

Here's how this heuristic (if few exs, genlize)
would have to be stated <SLIDE: detailed heur:genl>
Note that despite even this amount of detail we're still talking one level above LISP.

.SELECT 1

With hundreds of heuristics, the system can't afford to run throught them all
in each situation.
Note that this is ⊗4not⊗* a  standard problem faced by most heuristic
searches.    Like the  human  mathematician, the  system  should only
access a  strategy in those  situations where  it could plausibly  be
used.     Each  heuristic  was  generated   while  contemplating  one
particular concept, so we keep it tacked on to that particular concept.

For example, some of the heuristics tell whether a given  composition
is interesting or not. There's no reason  to access them unless we're
trying  to judge  a composition.   So  we keep  them attached  to the
COMPOSE concept, as just another facet.

<SLIDE:  genl branch>  Say  we  want  to  see  how  interesting  this
particular operation  is.   Then we  could use  evaluation heuristics
gathered from  any of these concepts. 
Any reason why an operation might be interesting will also be valid here.

The system would ripple upward,
in the direction  of increasing generality, collecting  heuristics as
it went.   They would be  evaluated, and the average  result would be
taken as the worth of this concept.

This is why the relevant heuristics were listed before as a facet  of
each concept,  like domain/range, examples,  and definition.   That's
also  a  nice symmetric  way  to  think of  things.    Formally, it's
equivalent to  having another  test  on the  left-hand side  of  each
heuristic  rule,  each production,  to  select  for the  concept  the
heuristic  is associated with.  But computationally, it  is much more
efficient to keep them separated.

If we're dealing with a bunch of concepts, we  ripple upward from all
of them.  All the heuristics that are gathered up get dumped together
into one little  production system, which  is then run  to produce  a
value.  

This works in practice because  the heuristics rarely couple, in the sense that
Newell defines coupling; the order in which they are executed is not
too important.
They  don't interact much. So  it's trivial to assemble  them and run
them together.
Simple schemes like this only work when the components of the system
are very independent, when the problem was easy to begin with.
Tight interdependence would require a more sophisticated approach.

.SELECT 1

By the way, using one of the system's own heuristics, if something is
interesting then it's worth naming it. I think the program has become
just interesting enough now to warrant a name; let's call it AM.

Let's finally take a look at the program in action.

.SELECT 1

	CARDINALITY EXAMPLE: OPTIONAL

This is  an actual excerpt  from a session  with AM,  as it first  discovers
Cardinality,  the concept of  numbers.   AM already knew  about sets,
bags, lists, etc., and equality of 2  sets.  A bag is a multiset,  an
unordered structure with repeated elements allowed.

<Cardinality slide>

I'll read through it once to  show you that AM really performs magic,
then we'll  peek behind the curtains to see that AM is really no more
than the  straight-forward  data-structure  expander that  I've  been
describing.

(1)  After testing random  structures to  see if  they are  equal, AM
concludes that it  would be  worthwhile to  generalize the  predicate
EQUAL. The user asks "why" and AM mentions that few randomly selected
objects turned out to be equal to teach other.

(2) Two  generalizations of EQUAL are constructed.   The first is turns
out to be the relationship "has the  same length as",  and the  second is "has  the
same first element as".

(3)  After  testing  random  structures  to  see  if  they  have  the
SAME-LENGTH,  AM decides  to  try to  find a  canonical form  for all
structures.  This  would mean  that two structures  were of the  same
length if and only if their canonical forms were identically equal.

(4)  AM comes  up with  the  idea of  making this  conversion  by (i)
converting each structure to a  BAG, and (ii) replacing each  element
by the  constant T.   Thus the  canonical form  of all structures  of
length  10 is  a  BAG of  10  T's.   AM decides  to  restrict various
operations like Union and Intersection to such  canonical structures,
and see what happens.

The user  tells AM  that such canonical  structures should  be called
NUMBERS.

Well, now that you've seen what AM says, let's see how it does it.

********************************************************************

To  start off slow, let's examine the  ways that AM found examples of
things which were EQUAL.  The highest-priority job was  selected. It
said to fill in the EXAMPLES facet of the EQUAL concept.  To do that,
AM gathered heuristics by rippling outward.

One of these rules of thumb said to instantiate the definition of
the predicate EQUAL. Here is the recursive definition of EQUAL: <defn
of EQUAL slide>.  An easy  way to instantiate this, to satisfy  this,
would be  simply to  satisfy this  base step.   To  do that,  the two
arguments  must  be  identical atoms.  So  some  examples  like 
NIL=NIL are produced.
Or we could invert this recursive step, by plugging in here objects
alredy known to be equal. But we already know that NIL=NIL. So the Cars 
of 2 structures
are both NIL, and their CDRs are both NIL, then the structures will be
Equal. SO an example like (List NIL) = (List NIL) is generated.

An entirely  different approach is suggested by  a heuristic tacked onto the
Activity concept.  It says  that,  when filling  in examples  of  any
activity or predicate, we can randomly instantiate the arguments, run
the concept,  and observe the results.  So AM randomly picks pairs of
objects and sees if they  satisfy EQUAL. This is where  the empirical
data  comes from  that tells  AM how  rarely EQUAL  is  satisfied. We
already saw  the heuristic  that said  to consider  generalizing  the
definition of a predicate, if  very few examples can be found.   So a
new job  is added, which says to  generalize the definition of EQUAL.
Since it has a good reason, it soon rises to the top of the job-list.

********************************************************************

AM is now  trying to generalize the definition of  EQUAL.  AM's first
magic trick is turning equality into equal-length.  Look at this definition.
Two objects  are equal if they are identical atoms, or
if they are both lists  and their cars are  equal and their cdrs  are
equal.  This is the recursive definition that AM wants to generalize.
Notice that  it is composed of a base  step <point> and two recursive
calls which  must both  succeed.   In such  a case,  an  easy way  to
generalize is  to assume that one  of them will always  succeed. That
is, force  the predicate to recur successfully in only one direction,
either the CARs or the  CDRs, but not both.  In particular,  by doing
away with the test in  the CAR direction <defn OVERLAY slide>, we get
a predicate which counts down two lists, and returns T if and only if
they both become empty at the same time.   That is, iff they have the
same length.

The  second  magic  trick  was  deriving  the  canonical  form  of  a
structure.  AM did this by selective experimentation, which I'll describe
later if anyone asks.


This whole development was  done to satisfy just 3 jobs from the agenda
list:

.BEGIN FILL INDENT 4,0,0 PREFACE 0

(Fillin Exs. of Equality)

(Generalize Defn. of Equality)

(Canonize Same-length wrt Equality)

.END

.SELECT 1

	GOOD RUN

Everything that I've described  has been coded and debugged,  and all
the examples  run, but I always seem to have  20 pages of notes about
additions I have to make to AM,  and better ways to do things and  so
on.

Let me  describe the best run  so far, which occurred  last month, in
fact.  The system started with the 100 prenumerical concepts I showed you
earlier, and after 1
cpu hour,  it had derived  about 100 new ones. Half of those  were real
losers,  but  the  rest  were  interesting.    Some  deep  chains  of
discoveries were in fact  made.  <SLIDE chain>  Note it went  through
the whole chain of  discoveries I talked about during  this talk, all
the   way   from   set-theoretic  notions   to   numbers   to  unique
factorization. 

AM did this all on its own; if I sit at the keyboard and guide it along,
the same discoveries can be made in about a third the cpu time.

200 jobs from the  agenda were executed  during this period. 
Each  job was allocated from  1 to 100 cpu
seconds, depending  on its priority  value.   Most jobs were  granted
about 30 seconds, but used only 20 or 25.

The 1-hour run ended because of space problems. I'm now going to save the
best of the derived concepts, throw away the rest, and re-start AM from that
point. I haven't tried that yet.

How general is AM?
I think that the basic design could be used to form theories in many
empirical sciences; the only problem is obtaining the all the data that
AM would ask for; each time it proposed an experiment, for example.
That would be trivial in most formalizable fields, like physics or
other domains of mathematics. It would also be easy for Programming,
where doing an epxeriment would mean doing an EVAL. It would be ridiculous
in softer fields like sociology, where doing an experiment entailed
months or years of effort.

In any domain, including the simple one I chose, the major problem is going
to be providing heuristics which are powerful enough
to  recognize  quickly  whether  each  new  concept  is  going to  be
dead-end.

The most exciting  aspect of the  project is  the ability to  perform
experiments  on  the  system:   modify  the  starting  concepts,  the
heuristics,  and observe the  results.  This  would be using  AM as a
tool, to explore  the impact of various  pieces of knowledge.   We're
beginning such  experiments with AM this month.

	MAXIMALLY-DIVISIBLE NUMBERS

After  spending a year  convincing my  reading
committee not to expect any new mathematical knowledge to emerge, one
brand new  mathematical theory was accidentally  discovered. 
I'll close this talk by indulging
my maternal pride, and
show you one slide about it. <OVERLAY: Maxdivis>

Just as AM decided to look at numbers with extremely ⊗4few⊗* divisors,
it thought to look at those with abnormally ⊗4many⊗* divisors.
<SLIDE  AM conjec> For  any number d,  the
smallest number with  at least d divisors is of this form, where these
exponents decrease inversely as the logs of the primes.  For example,
here's the smallest number with  4 million divisors, and in  fact the
exponents of its prime  factors satisfy this constraint.  Although it
required  a  couple  humans  to   precisely  state  and  prove   this
constraint, it was AM which suggested what to look for.

In fact, I  happened to mention to a  friend of mine that  the system
was wasting a lot  of time in this worthless area, but that it always
had a good reason for what it was doing. He suggested that maybe this
concept ⊗4was⊗* interesting, and, as in the old Hilbert joke, after a
day or two of thought I concluded it ⊗4was⊗* intuitively interesting.

Ideally, one could run AM overnight, and
wake up the next morning to find five or ten more or less interesting
new concepts to look at.   It would still be the  researcher who made
the  highest-level  selections  of  what  to  do  next, much  like  a
professor directing graduate students.
This  is the ultimate use that I hope  such systems will be
put to -- probably in the next centry: 
to  work alongside a researcher,  to free him from  doing all
the routine detective work.

.SKIP TO COLUMN 1
**********************************************************************

SUPPLEMENTARY: Finding canonizing function for numbers

Experiment number 1  was to  convert the arguments  from one type  of
structure  to  another,  and   see  if  that  changed  the  value  of
SAME-LENGTH when it was applied to them.  It turned out not to, which
meant that one single canonical type of structure could be chosen.

But there  are 4  kinds it  could be:  a set,  a list, a  bag, or  an
ordered set.   The particular kind of  structure depends on 2 things:
whether altering the order of the elements in the arguments makes any
difference to SAME-LENGTH, and whether adding multiple occurrences of
the same element makes any difference to SAME-LENGTH.

But  changing  the  order  makes  no  difference,  so  the  canonical
structure type  must be  unordered, either BAG  or SET.   But  adding
multiple elements ⊗4does⊗* make a difference, so the type must be BAG
or LIST.  The only possible canonical type is therefore BAG.

Next, AM notices  a message in  the Ties facet  of SAME-LENGTH,  that
explicitly  says  that  it's  the  same   as  Equality,  except  that
SAME-LENGTH does not recurse in the CAR direction. The CAR of an atom
is its  value cell, so  that message  is enough  evidence to  warrant
testing whether or  not the value cells of any  of these elements are
ever  accessed.  Empirically,  they never  are. So AM  assumes it can
replace them all by a single, fixed value: say "T".

Putting this all together,  we see the canonizing  transformation as:
convert the structure to a BAG, and substitute T for each element.

**********************************************************************

Another thought I hope you'll leave with today is that (for this task
at least) if the heuristic operators are sophisticated enough, then a
simple numeric calculus is all the meta-level heuristics you need for
adequate   performance.      And   you   don't   need   any  explicit
meta-meta-heuristics at all.  It would be interesting to  see if this
is really true  for other tasks as well.  I  suspect tht this is true
whenever the heuristics are fairly independent of each other.

One problem we all  face is how  to cope with  changes in a  system's
knowledge  base. The  standard  solution is  to  track down  all  the
effects of each change, and update the whole data base.  The solution
AM uses,  and  perhaps people  too, is  to  just record  the  changed
information, and ⊗4not⊗* to update  everything.  When a contradiction
of  some kind occurs, then the  conflicting knowledge is checked, the
knowledge ⊗4it⊗*  was  based  on is  checked,  and  so on,  until  we
encounter some  changed opinion which explains the  disagreement.  So
the system is allowed to hold contradictory viewpoints, so long as it
could resolve any  dispute if it  had to.   This is one  solution you
might  keep in mind  for your  knowledge-based systems. I  think that
⊗4this⊗* works because there are so few contradictions; that  in turn
is due to the independence of the resoning involved.